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Abstract 

Intensive English Program (IEP) Instructors and content faculty both listen to 
international students at the university. For these two groups of instructors, this study 
compared perceptions of in ternational studen t speech by collecting comprehensibility 
ratings and transcription samples for intelligibility scores. No significant differences 
were found between the two groups, suggesting that the perceptions of these two groups 
are reasonably well-matched. Seven linguistic features were assessed, and grammatical 
accuracy was found to have the strongest effect on content faculty’s ratings of 
comprehensibility. None of the linguistic features correlated significantly with 
intelligibility. These results raise questions about IEP assessment practices for speaking. 
Rankings of speech samples based on comprehensibility, intelligibility, and linguistic 
features all yielded different pictures of who was most understandable, highlighting the 
implications of using different criteria to assess student speech. 
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Introduction 

Over the last decade, international 
student enrollment increased markedly at 
many universities in the USA. In the 
2015-2016 academic year, 54 US 
universities reported more than 10% 
international students (US News and 
World Report, 2016), up from 38 two 
years earlier. Given this situation, issues 
of mutual comprehensibility between 
international students and their 
professors and classmates have gained 
salience in discussions of pedagogy and 


policy. At one US university, a recent 
survey of instructional faculty found that 
73% of respondents expressed concern 
about the oral communication skills of 
international undergraduates, with “an 
overwhelming sense that non-native 
speaking students are taking courses 
before their English language proficiency 
is adequate” (Evans, 2014, p. 2). When 
international students are underprepared 
in language, it can lead to negative 
outcomes such as a loss of confidence for 
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students and a lack of respect for the 
skills and knowledge these students bring 
to their universities (Ryan & Viete, 

2009). 

As instructors in a university-based 
Intensive English Program (IEP), the first 
two authors prepare international 
students to communicate in English at 
the university. When teaching speaking 
skills, we began to wonder how well our 
perceptions and 
assessments of our 
students’ speaking 
matched the 
perceptions of the 
instructional faculty 
who would work with 
our students after they 
exited the IEP. If there 
were a mismatch in 
perceptions, it could 
lead to misdirected 
instruction and wasted effort. We 
wondered whether some specific features 
of our students’ spoken English might 
have a greater or lesser effect on 
perceptions of comprehensibility among 
university instructors outside of the IEP. 
If this were the case, we could place a 
greater emphasis on developing those 
features before students matriculate to 
the university. Thus, our research 
questions were: 

1. Do university ESL instructors ’ 
perceptions of international 
student speech match the 
perceptions of university faculty? 

2. What features of spoken language 
should we emphasize in order to 
best prepare students to meet the 
expectations of university 
faculty? 


Literature 

Comprehensibility and intelligibility 
both refer to how well speech can be 
understood. They are sometimes used 
interchangeably, but here we use them in 
their narrow definitions based on Munro 
& Derwing (1995). Comprehensibility is 
defined as a listener’s subjective rating of 
how easily they could understand. It is 
conventionally measured with a nine- 
point Likert scale. 
Intelligibility, 
however, is defined as 
a somewhat more 
objective measure of 
how much the listener 
can actually 
understand the 
speaker’s message. It 
is often measured via 
transcription 
accuracy, but 
sometimes other techniques such as 
questions about the content of the audio 
text are used. 

It is apparent from these definitions 
that both comprehensibility and 
intelligibility depend on the listener as 
well as on the speaker. They are 
measures of listener perception, not 
features of speech, and have been found 
to correlate with various features of the 
listener such as familiarity with the topic 
(Gass & Varonis, 1984; Kennedy & 
Trofimovich, 2008), familiarity with the 
accent or first language (LI) of the 
speaker (Bradlow & Bent, 2008; 

Derwing & Munro, 1997), general 
familiarity with language learner speech 
(Kennedy & Trofimovich, 2008; Baese- 
Berk, Bradlow & Wright, 2013), and 
attitudes towards the speaker 
(Lindemann, 2002; Kang & Rubin, 

2009). 


“[W]e began to wonder how 
well our perceptions and 
assessments of our students ’ 
speaking matched the 
perceptions of the instructional 
faculty who would work with 
our students after they exited 
the IEP.” 
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A variety of studies have investigated 
what features of speech correlate with 
higher ratings for comprehensibility and 
intelligibility, generally using native 
speakers as listeners. The following 
features have been found to have a 
statistically significant relationship with 
comprehensibility ratings: 

• Word stress (Isaacs & 
Trofimovich, 2012; Hahn, 2004) 

• Grammatical accuracy (Derwing 
& Munro, 1997; Isaacs & 
Trofimovich, 2012); 

• Lexical richness (Isaacs & 
Trofimovich, 2012) 

• Fluency (Isaacs & Trofimovich, 
2012) and speaking rates (Kang, 
2010; Munro & Derwing, 2001). 

The following features have been 
found to have an effect on intelligibility 
scores: 

• Word stress accuracy (Field, 
2005; Hahn, 2004) 

• Phonemic accuracy in strong 
syllables (Zielinski, 2006) 

• Grammatical accuracy (Derwing 
& Munro, 1997). 

Finally, important literature criticizes 
approaches to measuring 
comprehensibility and intelligibility that 
assume the perceptions of native 
speakers are the only or most appropriate 
standard by which to judge the speech of 
non-native speakers. A more appropriate 
standard might be mutual 
comprehensibility among non-native 
speakers of English (Jenkins, 2002; 
Murphy, 2014), or among subject groups 
with actual communication needs (e.g. 
undergraduate students and their 
professors). For this reason, the current 


study compares perceptions of 
international students’ speech in two 
target audiences (ESL instructors and 
Content Faculty at the same university) 
without regard for the LI of these 
listeners. 

Background 

In a previous study (Sheppard, 

Elliott, & Baese-Berk, 2017), two online 
surveys collected responses from 
different groups of instructors at a U.S. 
university. The first group included 
instructors in any field except language; 
we referred to this group as Content 
Faculty. The second group included ESL 
instructors who currently or recently 
taught speaking skills in the university 
IEP. The surveys were slightly different 
for each group. 

Both surveys included 10 classroom 
recordings of IEP students in their last 
weeks before entering the university. The 
students spoke spontaneously for 1-2 
minutes in response to a question. 

Content Faculty were asked to rate the 
comprehensibility of these speeches on a 
9-point Likert scale with directions to 
assign a 9 if the speech was very easy to 
understand, a 5 if the speech was 
completely comprehensible given 
significant special effort, and a 1 if the 
speech was mostly incomprehensible 
even with extra effort (see Figure 1). 

ESL Instructors were asked to rate 
overall comprehensibility using the same 
directions, and were also asked to rate six 
language features (vowel pronunciation, 
consonant pronunciation, stress/rhythm, 
intonation, grammatical accuracy, 
fluency) of the speech on the same 9- 
point Likert scale, with directions to rate 
the degree to which each specific aspect 
was a cause of comprehensibility 
challenges. Both groups were instructed 
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Figure 1 - Screen shot from Content Faculty Survey 


Speaker 3 

Question: What is the most important quality of a good person? 

► DD:OD 00:00 •! -■Hill. 


1 = mostly incomprehensible, even with significant special effort 
5 = completely comprehensible, but requires significant special effort 
9 = as easy to understand as a native speaker 


123456789 

How easy was this 
speaker for you to 
understand? 


to listen only once, and both groups rated 
all 10 speakers in random order. 

Participants in both surveys were 
then presented with short excerpts from 
the same 10 student speech samples, 
once again in random order. These were 
excerpted from the same recordings by 
selecting the first complete sentence of 
appropriate length (4-6 content words). 
Survey participants were asked to listen 
once and type what they heard. The 
resulting transcripts were then coded as a 
match or a mismatch for each content 
word, and trivial errors such as 
regularizations and substitution of 
equivalent forms were disregarded. After 
one researcher coded all the transcription 
data, the other researcher coded 10% of 
the data. The two researchers agreed on 
241 out of 242 content words in this 


sample, or 99.59% agreement. Our 
measure of intelligibility was the 
proportion of words correctly 
transcribed, represented as a score of 0-1 
(e.g. if 87% of words were correctly 
transcribed, the intelligibility score was 
0.87). 

In this previous study, there was no 
significant overall difference between the 
two groups of participants in either 
comprehensibility or intelligibility. This 
result bears on research question 1, and 
we will return to it below. 

ESL Instructors’ ratings of the six 
language features were highly inter- 
correlated in the previous study. The four 
pronunciation features had Pearson 
correlations between 0.78 and 0.96, and 
grammar and fluency also were 
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significantly correlated with many 
pronunciation features. Since we could 
not be sure that these scores represented 
separate constructs, reports of these 
features were omitted 
from the published paper 
(Sheppard, Elliott, & 

Baese-Berk 2017). We 
hypothesized that this 
result may have arisen 
because instructors were 
only allowed to listen 
once. Scoring six 
separate language 
features in one hearing may be an 
excessive demand that resulted in a halo 
effect (the tendency for scores on 
separate items to be based on a general 
impression of skill, rather than on the 
actual criteria to which they are supposed 
to refer). It should be noted that scoring 
rubrics with six dimensions for use in 
classroom situations are not altogether 
uncommon in our profession. We will 
discuss our update to this portion of the 
study below. 

Methods 

Due to our limited confidence in the 
scores for individual features in the 
previous study, we conducted a follow¬ 
up study. We applied for and were 
awarded a Marge Terdall Research Grant 
from ORTESOL to hire five expert 
raters, who provided careful ratings of 
the speech samples used in the previous 
study. Raters were ESL instructors 
specializing in speaking instruction, 
selected from among the leadership of 
our IEP. Instead of listening just once, 
they had 15 minutes to rate each l-to-2- 
minute speech segment. 


One additional feature (lexical 
accuracy) was added to the analysis, for a 
total of 7 dimensions: vowel 
pronunciation, consonant pronunciation, 
stress/rhythm, 
intonation, grammatical 
accuracy, lexical 
accuracy, and fluency. 
Raters were also given 
space to write 
comments on each 
speaker. 

These new ratings 
of language features 
were combined with survey data from the 
study described in “Background” above. 

Results and discussion 

For each speaker and language 
feature, five expert raters provided scores 
on a scale of 1-9 (with the same 
definitions of end and middle points as in 
the previous survey). The five raters were 
quite consistent in their assignment of 
scores. As a measure of inter-rater 
reliability, Chronbach’s alpha was 
calculated separately for ratings of each 
feature of the students’ speech, with 
results ranging from 0.840 to 0.915. The 
mean interrater reliability for all seven 
features was 0.860 (SD 0.037). 

For each speaker (n=10), a mean 
score was taken from the five raters to 
represent perception of that speaker in 
each of the seven language features. 
These means (presented in Table 1) were 
entered into subsequent analyses as 
language feature ratings. Results were 
analyzed for inter-correlation between 
the features, and the results are presented 
in Table 2. 


“What features of spoken 
language should (ESL 
Instructors) emphasize, in 
order to best prepare 
students to meet the 
expectations of university 
faculty?” 
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Table 1 - Mean language feature ratings for each individual speaker 



SI 

S2 

S3 

S4 

S5 

S6 

S7 

S8 

S9 

S10 

Vowel 

pronunciation 

5.6 

4 

5 

4 

5.8 

5.6 

4.4 

5.8 

5.4 

3.6 

Consonant 

pronunciation 

5.2 

3.6 

4.6 

5.2 

6.4 

4.6 

4.4 

6.2 

5.6 

3.4 

Stress/rhythm 

4.8 

3.2 

4.6 

4.8 

5.4 

5 

4.4 

5.8 

4.8 

4.2 

Intonation 

4.6 

4.6 

4.8 

5.6 

6.2 

4.6 

4.8 

6.8 

4.8 

4.8 

Grammatical 

accuracy 

4.4 

4.8 

4.4 

5 

5.4 

6.4 

5 

5.6 

5 

4.4 

Lexical 

accuracy 

4.4 

4.8 

4.8 

5.2 

5.2 

6.6 

4.6 

5.4 

5 

4.8 

Fluency 

4.4 

4.6 

4.4 

5.8 

5.4 

6.4 

5 

7 

4.6 

5 


Table 2 - Pearson's correlations for expert ratings of language features. * p<. 05; 

** p <.01 



Vowel 

pronunc. 

Consonant 

pronunc. 

Stress/ 

rhythm 

Intonation 

Grammat. 

accuracy 

Lexical 

accuracy 

Consonant 

pronun¬ 

ciation 

.782** 






Stress/ 

rhythm 

.751* 

.857** 





Intonation 

.368 

.735* 

.711* 




Gramma¬ 

tical 

accuracy 

.490 

.372 

.492 

.339 



Lexical 

accuracy 

.358 

.186 

.404 

.185 

911 ** 


Fluency 

.303 

.401 

.631 

.673* 

.795** 

.741* 
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Inter-correlation between ratings of 
the seven language features was found to 
be less powerful than they were in survey 
results from the previous study, but still 
significant in many cases. In particular, 
the four pronunciation features (vowels, 
consonants, stress, and intonation) were 
significantly correlated 
with each other, and 
the three other features 
(grammatical and 
lexical accuracy, 
fluency) were also 
correlated with each 
other. Unlike previous 
study’s survey results, 
these two groups of 
features were distinct 
from each other, with no statistically 
significant correlation between the 
groups taken as a whole (r = .493, 
p=.147). 

Research Question 1: Do ESL 

instructors ’perceptions of international 
student speech match the perceptions of 
university faculty? 

For comprehensibility ratings, ESL 
Instructor ratings were significantly 
correlated with Content Faculty’s ratings 
(r = .665, p=.036). For intelligibility 
scores, the two groups were even more 
highly correlated (r = .942, p<.001). 
These results indicate no reason to 
suspect a major mismatch between 
perceptions of ESL Instructors and 
Content Faculty when listening to 
international students - an encouraging 
outcome, since it suggests that the two 
groups may evaluate student speech 
using similar implicit standards and 
should be able to communicate about 
students’ language needs. 


Although the two subject groups had 
broadly similar perceptions of student 
speech, the different criteria used to 
evaluate understandability 
(comprehensibility scores, intelligibility 
transcriptions, and ratings of language 
features) resulted in different evaluations 
of which students were 
more understandable. The 
10 speech samples were 
ranked from highest to 
lowest scores for each set 
of criteria, giving an 
impression of which 
speakers were perceived 
to be more 

understandable according 
to the different criteria. 
For language features, the mean of all 
seven feature scores was used for this 
ranking. In Table 3, speech samples 
(identified as SI-SI0) are arranged in 
rank order according to each of these 
three criteria, with those at the top 
perceived as the most understandable and 
those at the bottom the least 
understandable. 

Within each type of rating/score, the 
two sets of rankings look fairly similar, 
and indeed, the rankings from the two 
groups are statistically correlated for 
within each criterion. In other words, 
Content Faculty and ESL Instructors tend 
to agree about which speech samples are 
the most/least comprehensible (except 
for S9, who was removed from these 
analyses as an outlier) and about which 
speech samples are the most/least 
intelligible. Similarly, ESL instructors in 
the previous study tended to agree with 
expert raters in the current study about 
which speech samples had strong/weak 
language feature ratings. 


“These results indicate no 
reason to suspect a major 
mismatch between 
perceptions of ESL 
Instructors and Content 
Faculty when listening to 
international students ” 
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Table 3 - Speech samples 1-10 ordered according to rating/score from highest to lowest 


Ranking 

Comprehensibility 

Ratings 

Intelligibility Scores 

Mean rating of 7 
language features 


Content 

ESL 

Content 

ESL 

Survey 

Paid 

raters 

Highest 

score/rating 

Sy 

V 

Lowest 

score/rating 

S6 

S6 

S3 

S4 

S8 

S8 

S4 

S5 

S2 

S3 

S5 

S5 

S8 

S4 

S6 

S6 

S6 

S6 

S9 

S7 

S4 

S7 

S4 

S4 

S7 

S8 

S5 

S2 

S9 

S9 

S5 

S2 

S7 

S5 

S7 

SI 

S3 

SI 

SI 

SI 

S3 

S7 

SI 

S10 

S9 

S9 

S2 

S3 

S2 

S3 

S8 

S8 

S10 

S10 

S10 

S9 

S10 

S10 

SI 

S2 


Between the three types of 
ratings/scores, however, greater 
differences in rankings appear. Rankings 
based on comprehensibility showed some 
similarity with rankings based on 
language features, although this was 
weaker than the similarity within each 
type or rating reported above. Rankings 
based on intelligibility scores did not 
have any significant correlation with 
rankings based on either 
comprehensibility or language features. 
This suggests that intelligibility 
(transcription accuracy) is a clearly 
different construct from 
comprehensibility and feature ratings. 

The relationship between overall 


comprehensibility and the mean of 
language feature ratings in the ranked 
evaluation of speech samples is less 
clear, but the visible difference in 
rankings gives reason to suspect that 
comprehensibility, while related to the 
seven language features analyzed here, 
also references other infonnation. This 
result demonstrates the importance of 
clearly defining tenns and criteria when 
discussing whether a speaker is “easy to 
understand.” 

Research Question 2: What features of 
spoken language should we emphasize, 
in order to best prepare students to meet 
the expectations of university faculty? 
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Results from expert raters’ analysis comprehensibility and intelligibility, 

of the 10 speech samples were compared Findings are displayed in Table 4. 

to Content Faculty survey results for 


Table 4 - Pearson ’s correlations between language feature ratings and Content Faculty 
comprehensibility and intelligibility. * =p<.05, **=p<.01 



Vowel 

Consonant 

Stress/Rhy 

Intonation 

Grammar 

Lexical 

Fluency 

Content 

Faculty 

Compre. 

.312 

.444 

.559 

.318 

.769** 

.739* 

.701* 

Content 

Faculty 

Intellig. 

.043 

-.050 

-.202 

-.242 

.099 

.077 

-.184 


Grammatical accuracy, lexical 
accuracy, and fluency ratings were 
significantly correlated with Content 
Faculty comprehensibility ratings. None 
of the other features correlated with 
either comprehensibility or intelligibility. 
A stepwise multiple regression showed 
that grammatical accuracy had the 
greatest effect on Content Faculty 
comprehensibility F(l,8)=l 1.596, 
p=.009, indicating that 76.9% of the 
variance in comprehensibility ratings 
could be accounted for by expert ratings 
for grammatical accuracy. No other 
variables entered into the equation, 
indicating that they did not add to the 
statistical effect. It should be 
remembered, however, that lexical 
accuracy and fluency were strongly inter- 
correlated with grammatical accuracy. 
These three features may not be acting as 
separate variables in our study. 

Implications for instruction and 
assessment of ESL students 

Comprehensibility is an important 
goal for IEP instruction, particularly 
when measured according to the 
perception of an actual target audience 
(in this case, content faculty). Listener 


perceptions of effort and difficulty in 
understanding can affect attitudes and 
willingness to listen. The finding that the 
Content Faculty group perceived speech 
as more easily comprehensible when it 
was rated as more grammatically 
accurate (in conjunction with higher 
ratings for lexical accuracy and fluency) 
aligns with previous work (Isaacs & 
Trofimovich, 2012). This might 
influence IEP speaking instructors to 
increase their emphasis on oral grammar, 
vocabulary, and fluency, perhaps 
reducing pronunciation instruction. 
Certainly, instruction for accuracy in 
grammar and vocabulary may support 
increased comprehensibility. It is less 
clear that pronunciation instruction 
should be reduced. Comprehensibility 
(perceived ease of understanding) is just 
one construct that affects communication 
between speakers and listeners, and 
pronunciation can affect communication 
in other ways. Aspects of pronunciation 
that were not captured in this study may 
also affect comprehensibility. 

Intelligibility is also an important 
goal for IEP instruction, since 
international students often need to make 
themselves explicitly understood when 
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speaking to their instructors, peers, and 
others. The fact that none of the language 
feature ratings examined here had a 
significant correlation with transcription 
accuracy raises questions about what 
features impact intelligibility and what 
methods of assessment can capture these 
features. The literature indicated that 
aspects of pronunciation can affect 
intelligibility. Our first study indicated 
that grammatical accuracy positively 
affected intelligibility while fluency 
negatively affected it. Further study is 
needed on both intelligibility and 
comprehensibility with a variety of target 
audiences. 

Finally, the inter-correlations that 
occurred when ESL teachers rated 
student language features can have 
implications for rubric design in 
speaking assessment. ESL teachers in 
this study had a hard time differentiating 
students’ strengths and weaknesses in 
vowel, consonant, stress, and intonation, 
even when they spent 15 minutes 
replaying each 1-2 minute recorded 
speech sample. Analytic rubrics used for 
classroom assessment sometimes include 
these elements, while in other instances, 
a single “pronunciation” score is 
included. This latter approach may be 
preferable in light of these findings. 

More significantly, grammar, 
vocabulary, and fluency were not clearly 
differentiated by expert raters in this 
study, and these are frequently 
represented as separate dimensions on 
classroom rubrics for speaking 
assessment. Of course, it may also be the 
case that the students whose speech 
samples were recorded really did vary in 
only two dimensions, that every student 
who had good grammar was also very 
fluent and made accurate vocabulary 
choices. It seems more likely, however, 


that something is amiss in teachers’ 
ability to separately assess these features. 
Perhaps such rubrics could be redesigned 
to focus instructors’ attention on very 
specific aspects of each feature (e.g. 
speaking rate and/or use of pauses, 
instead of fluency). Additionally, 
instructors should stay aware of the 
possibility of halo effects whenever they 
use complex analytical rubrics. 

Conclusion 

This study compared the perceptions 
of two actual audiences for international 
student speech: IEP instructors who 
prepare the students for university study, 
and the content faculty members with 
whom they work upon completion of 
their studies in the IEP. No significant 
differences between the perceptions of 
the two groups were found - a 
potentially encouraging result for 
collaboration between IEP instructors 
and content faculty who teach 
international students. Grammatical 
accuracy in association with lexical 
accuracy and fluency was found to have 
the most significant effect on faculty 
ratings of comprehensibility, while no 
individual feature’s ratings clearly 
predicted faculty’s intelligibility scores. 

There were several limitations in the 
design of this study, and a need for 
further research. Although clear 
instructions were given for the survey, 
researchers cannot guarantee that all 
subjects followed directions to use 
headphones and listen only once. If 
possible, it would be preferable to run the 
study in a controlled environment. 
However, the choice to use an online 
survey was based on the need to include 
busy professors at a research university. 
Two limitations may have contributed to 
the lack of significant correlations 
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between expert ratings of language 
feature and target audience intelligibility. 
First, intelligibility scores were based on 
excerpts taken from longer speeches, 
which were considered in their full 
length for ratings of language features. If 
the selected sentence was not 
representative of other sentences in the 
sample, it would weaken results. Second, 
language feature ratings were based on 
holistic rater impressions, while 
intelligibility scores were based on a 
quantitative measure for words 
transcribed. For some language features, 
quantitative measurement of the speech 
sample would be possible, and would be 
a useful comparison. The use of expert 
raters who are also ESL instructors, 
however, allows increased applicability 
to questions of ESL speaking assessment. 
Future studies should consider matching 


the length of comprehensibility and 
intelligibility samples, and analyzing 
speech samples for those features that 
can be quantitatively measured. 

Since challenges with comprehensibility 
and intelligibility can reduce 
international students’ academic 
confidence and reduce opportunities for 
everyone at the university to benefit from 
the cultural and content area knowledge 
of international students, it is important 
to understand how international students 
and their listeners succeed and fail to 
understand each other. Other universities 
may want to complete similar studies to 
compare aspects of student speech with 
the perceptions of various target 
audiences. 
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